runtime error
QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback
Mikuriya, Taku, Ishigaki, Tatsuya, Kawarada, Masayuki, Minami, Shunya, Kadowaki, Tadashi, Suzuki, Yohichi, Naito, Soshun, Takata, Shunya, Kato, Takumi, Basseda, Tamotsu, Yamada, Reo, Takamura, Hiroya
Large language models (LLMs) have increasingly been applied to automatic programming code generation. This task can be viewed as a language generation task that bridges natural language, human knowledge, and programming logic. However, it remains underexplored in domains that require interaction with hardware devices, such as quantum programming, where human coders write Python code that is executed on a quantum computer. To address this gap, we introduce QCoder Benchmark, an evaluation framework that assesses LLMs on quantum programming with feedback from simulated hardware devices. Our benchmark offers two key features. First, it supports evaluation using a quantum simulator environment beyond conventional Python execution, allowing feedback of domain-specific metrics such as circuit depth, execution time, and error classification, which can be used to guide better generation. Second, it incorporates human-written code submissions collected from real programming contests, enabling both quantitative comparisons and qualitative analyses of LLM outputs against human-written codes. Our experiments reveal that even advanced models like GPT-4o achieve only around 18.97% accuracy, highlighting the difficulty of the benchmark. In contrast, reasoning-based models such as o3 reach up to 78% accuracy, outperforming averaged success rates of human-written codes (39.98%). We release the QCoder Benchmark dataset and public evaluation API to support further research. (Codes and datasets are available at https://qcoder-bench.github.io/ )
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- (2 more...)
Benchmark Dataset Generation and Evaluation for Excel Formula Repair with LLMs
Singha, Ananya, Sahijwani, Harshita, Williams, Walt, Boateng, Emmanuel Aboah, Hausman, Nick, Di Luca, Miguel, Choudhury, Keegan, Binet, Chaya, Le, Vu, Chen, Tianwei, Chen, Oryan Rokeah, Vesal, Sulaiman, Hasan, Sadid
Excel is a pervasive yet often complex tool, particularly for novice users, where runtime errors arising from logical mistakes or misinterpretations of functions pose a significant challenge. While large language models (LLMs) offer promising assistance by explaining formula errors, the automated correction of these semantic runtime errors remains an open problem. A primary challenge to advancing models for such scenarios is the severe lack of high-quality, comprehensive datasets for training and rigorous evaluation. This paper addresses this gap by introducing a novel approach for constructing a benchmark dataset specifically designed for Excel formula repair. We propose a data generation pipeline, which leverages a small set of curated seed samples from online forums to synthetically expand the dataset. Our pipeline integrates few-shot prompting with LLMs and employs a robust \textit{LLM-as-a-Judge} validation framework, combined with execution-based checks to ensure the correctness and semantic fidelity of the generated data. This process produced a benchmark dataset of 618 high-quality samples, covering common runtime errors. Furthermore, we propose a context-aware baseline technique for Excel formula repair that utilizes LLMs to leverage both the faulty formula, and relevant spreadsheet context. We evaluate the performance of various LLMs (GPT-4o, GPT-4.1, Phi-3, Mistral) on our newly generated benchmark using execution-based metrics. Our analysis demonstrates the dataset's quality through manual annotation and provides insights into error and function distributions. The proposed generation methodology is highly scalable and can be readily adapted to create evaluation benchmarks for similar code repair tasks in other low-resource programming languages.
- North America > Canada > Ontario > Toronto (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
MultiAIGCD: A Comprehensive dataset for AI Generated Code Detection Covering Multiple Languages, Models,Prompts, and Scenarios
Demirok, Basak, Kutlu, Mucahid, Mergen, Selin
As large language models (LLMs) rapidly advance, their role in code generation has expanded significantly. While this offers streamlined development, it also creates concerns in areas like education and job interviews. Consequently, developing robust systems to detect AI-generated code is imperative to maintain academic integrity and ensure fairness in hiring processes. In this study, we introduce MultiAIGCD, a dataset for AI-generated code detection for Python, Java, and Go. From the CodeNet dataset's problem definitions and human-authored codes, we generate several code samples in Java, Python, and Go with six different LLMs and three different prompts. This generation process covered three key usage scenarios: (i) generating code from problem descriptions, (ii) fixing runtime errors in human-written code, and (iii) correcting incorrect outputs. Overall, MultiAIGCD consists of 121,271 AI-generated and 32,148 human-written code snippets. We also benchmark three state-of-the-art AI-generated code detection models and assess their performance in various test scenarios such as cross-model and cross-language. We share our dataset and codes to support research in this field.
- Asia > Middle East > Republic of Türkiye > Ankara Province > Ankara (0.04)
- North America > United States (0.04)
- Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)
- Information Technology > Security & Privacy (1.00)
- Education (0.66)
Towards LLM-based Root Cause Analysis of Hardware Design Failures
Qiu, Siyu, Wang, Muzhi, Afsharmazayejani, Raheel, Shahmiri, Mohammad Moradi, Tan, Benjamin, Pearce, Hammond
--With advances in large language models (LLMs), new opportunities have emerged to develop tools that support the digital hardware design process. In this work, we explore how LLMs can assist with explaining the root cause of design issues and bugs that are revealed during synthesis and simulation, a necessary milestone on the pathway towards widespread use of LLMs in the hardware design process and for hardware security analysis. We find promising results: for our corpus of 34 different buggy scenarios, OpenAI's o3-mini reasoning model reached a correct determination 100% of the time under pass@5 scoring, with other state of the art models and configurations usually achieving more than 80% performance and more than 90% when assisted with retrieval-augmented generation. Encountering bugs, glitches, and faults is a normal part of the digital hardware design lifecycle. To ensure they are completely removed and repaired is a time-consuming process requiring a deep understanding of both the technical cause of the issue as well as any impacts on the broader hardware system - particularly as any missed repair may have severe downstream functional and/or security consequences [1] (if the bug is of an exploitable nature). However, as digital hardware grows in complexity, so do the frequency and nature of the bugs themselves.
- North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.14)
- Asia > India (0.04)
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- (4 more...)
AIGCodeSet: A New Annotated Dataset for AI Generated Code Detection
Demirok, Basak, Kutlu, Mucahid
With the rapid advancement of LLM models, they have become widely useful in various fields. While these AI systems can be used for code generation, significantly simplifying and accelerating the tasks of developers, their use for students to do assignments has raised ethical questions in the field of education. In this context, determining the author of a particular code becomes important. In this study, we introduce AIGCodeSet, a dataset for AI-generated code detection tasks, specifically for the Python programming language. We obtain the problem descriptions and human-written codes from the CodeNet dataset. Using the problem descriptions, we generate AI-written codes with CodeLlama 34B, Codestral 22B, and Gemini 1.5 Flash models in three approaches: i) generating code from the problem description alone, ii) generating code using the description along with human-written source code containing runtime errors, and iii) generating code using the problem description and human-written code that resulted in wrong answers. Lastly, we conducted a post-processing step to eliminate LLM output irrelevant to code snippets. Overall, AIGCodeSet consists of 2,828 AI-generated and 4,755 human-written code snippets. We share our code with the research community to support studies on this important topic and provide performance results for baseline AI-generated code detection methods.
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Middle East > Qatar (0.04)
- Education (0.48)
- Information Technology (0.47)
REDO: Execution-Free Runtime Error Detection for COding Agents
Li, Shou, Kan, Andrey, Callot, Laurent, Bhasker, Bhavana, Rashid, Muhammad Shihab, Esler, Timothy B
As LLM-based agents exhibit exceptional capabilities in addressing complex problems, there is a growing focus on developing coding agents to tackle increasingly sophisticated tasks. Despite their promising performance, these coding agents often produce programs or modifications that contain runtime errors, which can cause code failures and are difficult for static analysis tools to detect. Enhancing the ability of coding agents to statically identify such errors could significantly improve their overall performance. In this work, we introduce Execution-free Runtime Error Detection for COding Agents (REDO), a method that integrates LLMs with static analysis tools to detect runtime errors for coding agents, without code execution. Additionally, we propose a benchmark task, SWE-Bench-Error-Detection (SWEDE), based on SWE-Bench (lite), to evaluate error detection in repository-level problems with complex external dependencies. Finally, through both quantitative and qualitative analyses across various error detection tasks, we demonstrate that REDO outperforms current state-of-the-art methods by achieving a 11.0% higher accuracy and 9.1% higher weighted F1 score; and provide insights into the advantages of incorporating LLMs for error detection. Large language models (LLMs) and LLM-based agents have exhibited significant potential in code generation, code editing, and code evaluation. This progress has culminated in the development of advanced LLM-based agents (hereafter referred to as coding agents) designed to address increasingly complex tasks. For example, SWE-Bench (Jimenez et al., 2024a) presents a demanding benchmark comprising repository-level coding challenges. This benchmark requires coding agents to generate a modification patch that solves a given problem within a GitHub repository, based on a problem statement expressed in natural language. To effectively navigate complex tasks such as those posed by SWE-Bench, coding agents must demonstrate proficiency in the following core competencies: 1) comprehension of the problem statement and retrieving relevant code, 2) reasoning towards a functionally correct solution, and 3) generation of programs free from runtime errors such as SyntaxError, AttributeError, or TypeError. While the majority of coding agents across different tasks focus on enhancing comprehension, retrieval and reasoning capabilities, the systematic detection of runtime errors has received comparatively limited attention. However, ensuring that generated code is free from runtime errors is as critical as the aforementioned capabilities. For example, an AttributeError can cause the modified code to fail, irrespective of the agent's comprehension and reasoning processes.
Revisiting VerilogEval: Newer LLMs, In-Context Learning, and Specification-to-RTL Tasks
Pinckney, Nathaniel, Batten, Christopher, Liu, Mingjie, Ren, Haoxing, Khailany, Brucek
The application of large-language models (LLMs) to digital hardware code generation is an emerging field. Most LLMs are primarily trained on natural language and software code. Hardware code, such as Verilog, represents only a small portion of the training data and few hardware benchmarks exist. To address this gap, the open-source VerilogEval benchmark was released in 2023, providing a consistent evaluation framework for LLMs on code completion tasks. It was tested on state-of-the-art models at the time including GPT-4. However, VerilogEval and other Verilog generation benchmarks lack failure analysis and, in present form, are not conducive to exploring prompting techniques. Also, since VerilogEval's release, both commercial and open-source models have seen continued development. In this work, we evaluate new commercial and open-source models of varying sizes against an improved VerilogEval benchmark suite. We enhance VerilogEval's infrastructure and dataset by automatically classifying failures, introduce new prompts for supporting in-context learning (ICL) examples, and extend the supported tasks to specification-to-RTL translation. We find a measurable improvement in commercial state-of-the-art models, with GPT-4 Turbo achieving a 59% pass rate on spec-to-RTL tasks. We also study the performance of open-source and domain-specific models that have emerged, and demonstrate that models can benefit substantially from ICL. We find that recently-released Llama 3.1 405B achieves a pass rate of 58%, effectively matching that of GPT-4 Turbo, and that the much smaller domain-specific RTL-Coder 6.7B models achieve an impressive 37% pass rate. However, prompt engineering is key to achieving good pass rates, and varies widely with model and task. A benchmark infrastructure that allows for prompt engineering and failure analysis is key to continued model development and deployment.
- North America > United States > New York > Tompkins County > Ithaca (0.04)
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- Europe (0.04)
LLM as Runtime Error Handler: A Promising Pathway to Adaptive Self-Healing of Software Systems
Sun, Zhensu, Zhu, Haotian, Xu, Bowen, Du, Xiaoning, Li, Li, Lo, David
Unanticipated runtime errors, lacking predefined handlers, can abruptly terminate execution and lead to severe consequences, such as data loss or system crashes. Despite extensive efforts to identify potential errors during the development phase, such unanticipated errors remain a challenge to to be entirely eliminated, making the runtime mitigation measurements still indispensable to minimize their impact. Automated self-healing techniques, such as reusing existing handlers, have been investigated to reduce the loss coming through with the execution termination. However, the usability of existing methods is retained by their predefined heuristic rules and they fail to handle diverse runtime errors adaptively. Recently, the advent of Large Language Models (LLMs) has opened new avenues for addressing this problem. Inspired by their remarkable capabilities in understanding and generating code, we propose to deal with the runtime errors in a real-time manner using LLMs. Specifically, we propose Healer, the first LLM-assisted self-healing framework for handling runtime errors. When an unhandled runtime error occurs, Healer will be activated to generate a piece of error-handling code with the help of its internal LLM and the code will be executed inside the runtime environment owned by the framework to obtain a rectified program state from which the program should continue its execution. Our exploratory study evaluates the performance of Healer using four different code benchmarks and three state-of-the-art LLMs, GPT-3.5, GPT-4, and CodeQwen-7B. Results show that, without the need for any fine-tuning, GPT-4 can successfully help programs recover from 72.8% of runtime errors, highlighting the potential of LLMs in handling runtime errors.
- North America > United States > District of Columbia > Washington (0.05)
- Oceania > Australia (0.04)
- North America > United States > North Carolina (0.04)
- (2 more...)
Evaluating AI-generated code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust
Diehl, Patrick, Nader, Noujoud, Brandt, Steve, Kaiser, Hartmut
This study evaluates the capabilities of ChatGPT versions 3.5 and 4 in generating code across a diverse range of programming languages. Our objective is to assess the effectiveness of these AI models for generating scientific programs. To this end, we asked ChatGPT to generate three distinct codes: a simple numerical integration, a conjugate gradient solver, and a parallel 1D stencil-based heat equation solver. The focus of our analysis was on the compilation, runtime performance, and accuracy of the codes. While both versions of ChatGPT successfully created codes that compiled and ran (with some help), some languages were easier for the AI to use than others (possibly because of the size of the training sets used). Parallel codes -- even the simple example we chose to study here -- also difficult for the AI to generate correctly.
- North America > United States > Louisiana > East Baton Rouge Parish > Baton Rouge (0.15)
- North America > United States > New Mexico > Los Alamos County > Los Alamos (0.05)
- North America > United States > New York (0.04)
- Energy (0.69)
- Government > Regional Government (0.47)
Deep Learning With Weighted Cross Entropy Loss On Imbalanced Tabular Data Using FastAI
The dataset comes from the context of ad conversions where the binary target variables 1 and 0 correspond to conversion success and failure. This proprietary dataset (no, I don't own the rights) has some particularly interesting attributes due to its dimensions, class imbalance and rather weak relationship between the features and the target variable. First, the dimensions of the data: this tabular dataset contains a fairly large number of records and categorical features that have a very high cardinality. Note: In FastAI, categorical features are represented using embeddings which can improve classification performance on high cardinality features. Second, the binary class labels are highly imbalanced since successful ad conversions are relatively rare.